my ud120_iML

Datasets and Questions¶

Setup
The udacity pickles were created with python 2.x, so it is important we run and ensure python 2 in this notebook to be able to handle pickles. This could help creating conda env to influence jupyter's kernel choice to python 2

In [13]:

# ensuring python version
import sys
sys.version
sys.version_info

Out[13]:

sys.version_info(major=2, minor=7, micro=15, releaselevel='final', serial=0)

Basic questions about dataset¶

How many data points(people) are in the dataset?

In [14]:

from explore_enron_data import enron_data

total = len(enron_data)

For each person, how many features are available?

This should be length of 'value'of each key. Assuming each key or person has same no of features, let us try to get for 1st person.

In [15]:

len(enron_data.items()[0][1])

Out[15]:

How many POIs are there?

In other words, count the number of entries in the dictionary where data[person_name]["poi"]==1

In [16]:

count = 0
for k,v in enron_data.iteritems():
    if v['poi'] == 1:
        count += 1
count

Out[16]:

Query further the Dataset¶

Total value of stock belonging to James Prentice?

In [17]:

enron_data.get('PRENTICE JAMES',[])['total_stock_value']

Out[17]:

How many email messages do we have from Wesley Colwell to persons of interest?

In [18]:

enron_data.get('COLWELL WESLEY',[])['from_this_person_to_poi']

Out[18]:

What's the value of stock options exercised by Jeffrey K Skilling?

In [19]:

enron_data.get('SKILLING JEFFREY K',[])['exercised_stock_options']

Out[19]:

19250000

Research the Enron Fraud¶

Udacity recommends watching 'Enron: The Smartest Guys In the Room' documentry to proceed further. I used this reference.

Enron CEO during fraud: Jeffrey Skilling
Enron Chairman during fraud: Kenneth Lay
Enron CFO during fraud: Andrew Fastow

Follow the Money¶

Of these three individuals (Lay, Skilling and Fastow), who took home the most money (largest value of 'total_payments' feature)?

In [20]:

from operator import itemgetter

total_payments_dict = {}  # caz its person : his payment
poi_list = ['SKILLING JEFFREY K', 'LAY KENNETH L', 'FASTOW ANDREW S']
for each_person in poi_list:
    total_payments_dict.update( {each_person : enron_data.get(each_person,[])['total_payments']} )

max(total_payments_dict.iteritems(), key=itemgetter(1)) #ref: https://artemrudenko.wordpress.com/2013/04/12/python-finding-a-key-of-dictionary-element-with-the-highestmin-value/

Out[20]:

('LAY KENNETH L', 103559793)

Unfilled Features¶

How is it denoted when a feature doesn't have a well-defined value?

In [21]:

# testing
enron_data.get('SKILLING JEFFREY K',[])

Out[21]:

{'bonus': 5600000,
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'jeff.skilling@enron.com',
 'exercised_stock_options': 19250000,
 'expenses': 29336,
 'from_messages': 108,
 'from_poi_to_this_person': 88,
 'from_this_person_to_poi': 30,
 'loan_advances': 'NaN',
 'long_term_incentive': 1920000,
 'other': 22122,
 'poi': True,
 'restricted_stock': 6843672,
 'restricted_stock_deferred': 'NaN',
 'salary': 1111258,
 'shared_receipt_with_poi': 2042,
 'to_messages': 3627,
 'total_payments': 8682716,
 'total_stock_value': 26093672}

So answer is NaN

Dealing with unfilled features¶

How many folks in this dataset have a quantified salary?

In [22]:

people_counter = 0  # count only those having quantified salary

for k,v in enron_data.iteritems():
    salary =  v['salary']
    #print salary
    if salary != 'NaN':
        people_counter += 1
people_counter

Out[22]:

How many folks in this dataset have known email address?

In [23]:

email_counter = 0  # count only those having quantified salary

for k,v in enron_data.iteritems():
    email =  v['email_address']
    #print email
    if email != 'NaN':
        #if '..' not in email: # apparantly this is not a problem..
            email_counter += 1
email_counter

Out[23]:

Missing POIs 1¶

How many people in the E+F dataset (as it currently exists) have 'NaN' for their total payments? What percentage of people in the dataset as a whole is this?

In [24]:

from __future__ import division  # for python 2, sigh..
people_counter = 0  # count only those having unquantified payments that is 'NaN

for k,v in enron_data.iteritems():
    total_payments =  v['total_payments']
    #print email
    if total_payments == 'NaN':
        people_counter += 1
print people_counter
print people_counter/total

21
0.143835616438

Missing POIs 2¶

How many POIs in the E+F dataset have 'NaN' for their total payments? What percentage of POIs as a whole is this?

In [25]:

from __future__ import division  # for python 2, sigh..
poi_nan_counter = 0  
poi_total_counter = 0

for k,v in enron_data.iteritems():
    total_payments =  v['total_payments']
    poi = v['poi']
    #print email
    if poi == True:
        poi_total_counter += 1
        if total_payments == 'NaN':
            poi_nan_counter += 1
print poi_nan_counter
print poi_nan_counter/poi_total_counter

0
0.0

yeah, I double checked. Its 0.

Missing POIs 3¶

If a machine learning algorithm were to use total_payments as a feature, would you expect it to associate a "NaN" value with POIs or non-POIs?

With non-POIs because, in our training dataset, only non-POIs have some NaNs. which means, this could be a feature to learn to distinguish something on non-POIs.

On other hand, all POIs have quantified payments or none have 'NaN', so there is nothing to learn there to distinguish people (by checking if 'NaN' or not)

Missing POIs 4¶

If you added in, say, 10 more data points which were all POI's, and put 'NaN' for the total payments for those folks, the numbers you just calculated would change.
What is the new number of people of the dataset? What is the new number of folks with 'NaN for total payments?

In [26]:

# current no of people in dataset
people_counter =  len(enron_data)

# current no of folks with 'NaN' for total payments
nan_counter = 0
for k,v in enron_data.iteritems():
    if v['total_payments'] == 'NaN':
        nan_counter += 1

# 10 new POIs added, so 
people_counter = people_counter + 10
print 'new total: ' + str(people_counter)

# 10 new NaNs
nan_counter = nan_counter + 10
print 'new nans: ' + str(nan_counter)

new total: 156
new nans: 31

Missing POIs 5¶

What is the new number of POIs in the dataset? What is the new number of POIs with NaN for total_payments?

In [27]:

# current no of POIs
poi_counter = 0
for k,v in enron_data.iteritems():
    if v['poi'] == True:
        poi_counter += 1

# after new 10 pois
poi_counter = poi_counter + 10
print poi_counter

# since all earlier pois had quantified total_payments, new no of pois with NaN is 10

Missing POIs 6¶

Once the new data points are added, do you think a supervised classification algorithm might interpret 'NaN' for total_payments as a clue that someone is a POI?

Ans: Of course. Now some POIs have quantified payments and some have NaN. So a person having a NaN total payment could be a POI as well.